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The RooStatsCms (RSC) software framework allows analysis modelling and combination, statistical studies 
together with the access to sophisticated graphics routines for results visualisation. The goal of the project is to 
complement the existing analyses by means of their combination and accurate statistical studies. 



Soon the LHC machine will open a new exciting 
era of measurement campaigns. A reliable and 
widely accepted tool for analyses combination 
and statistical studies is needed at first for limits 
estimations. Then it should provide the basis for 
discoveries and finally parameter measurements. 
Our proposed solution is RooStatsCms pQ . 

1. Statistical methods 

1.1. Separation of signal plus background 
and background only hypotheses 

The analysis of search results can be formu- 
lated in terms of a hypothesis test. The first step 
consists in identifying the observables which com- 
prise the results, e.g. the number of signal events 
observed or the cross section for a particular pro- 
cess. Then a test statistic is specified. A good 
choice is Q' = — 21nQ, where Q = Li/Lq and Lq 
and L\ are, for a given dataset, the likelihoods 
computed in the Ho (background only, b) and 
7ii (signal plus background, sb) hypotheses re- 
spectively. The last step is to define rules for 
discovery and exclusion. In [2] is proposed the 
CL S = C L sb / C L b quantity value, where 
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and 



CL b = P b (Q< Qobs) 



The quantity i s the probability density func- 
tion of the test statistic in the sb experiments (see 
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Fig. [TT|). The 95% confidence level (CL) exclu- 
sion of the signal hypothesis is usually quoted in 
case of a CL S ^ 5%. 



-2lnQ method_plot (10000 toys) ] 




Figure 1. The distributions of — 21nQ in the 
background only (red, on the right) and signal 
plus background (blue, on the left) hypotheses. 
The black line represents the value of — 21nQ on 
the measured data. The shaded areas represent 
l-CL b (red) and CL sb (blue). The -21nQ vari- 
able can be used instead of Q to ease the calcu- 
lations without any loss of generality. 



1.2. Profile likelihood (PL) 

Suppose to have the Likelihood function 
P(%iQ.)i where x are the quantities measured for 



1 



2 



each event and 9 the K parameters of the joint 
pdf describing the data [3]. The idea of the PL 
technique is to select a parameter 9q and perform 
a scan over a sensible range of values. For every 
point of the scan, the value of 9q is fixed and the 
quantity — In L (nil) is maximised with respect to 
the remaining K-l 9i parameters. The values are 
then plotted versus 9q (Fig- H-2|) . Without any 
loss of generality, the negative loglikclihood scan 
is usually considered in terms of Anil, with its 
minimum value equal to 0: this is always possi- 
ble with a shift of all the scan points. This curve 
minimum point abscissa represents the maximum 
likelihood estimator for 9q, $o- Finally, upper lim- 
its and intervals of a given CL can be read simply 
from the PL curve, tracing horizontal lines at pre- 
defined vertical values [4]. 
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Figure 2. The Likelihood scan in terms of Anil 
and the horizontal line for the 95% CL upper limit 
(Anil ~ 1.36). The interpolation between scan 
point pairs is linear, while the minimum of the 
scan is the minimum of a parabola built on the 
lowest three points. 



2. The RSC framework 

Based on the RooFit [5] technology, RSC is 
composed of three parts: the analysis modelling 
and combination, the implementation of statisti- 



cal methods and sophisticated graphics routines 
for results visualisation. This framework is in- 
tended to be a product for the CMS collabora- 
tion but a public version, which does not include 
the modelling part, is available under the name 
of RooStatsKarlsruhe [6] . 

All the RSC classes inherit from the ROOT 
[7] TObject. In this way it is possible to proto- 
type very quickly macros to be run interactively 
and all the RSC objects can be written on disk 
in rootfiles exploiting the persistency mechanism 
implemented by ROOT. 

The analyses and their combinations are de- 
scribed in ASCII configuration files in the "ini" 
format: the datacards. The user at first specifies 
the observable(s) involved in the analysis. Thus 
the signal and background components relative 
to the observable(s) are characterised: the frame- 
work also allows to consider multiple components 
for one single observable. The single components 
can be the result of a counting analysis or be 
represented by shapes. Both cases can be taken 
into account and in presence of a shape, the user 
can decide to describe it through a parametri- 
sation or to read it directly from a ROOT hi- 
stogram stored in a rootfile. The yields of the 
components can be described as a single quan- 
tity or as a product of many factors, e.g. yield = 
£ ■ a production • BR ■ e: the product of the inte- 
grated luminosity, the production cross section, 
the branching ratio and the cuts efficiency. The 
systematics on all the the parameters present in 
the model description together with their corre- 
lations can also be described in the datacard. 

The construction of the RooFit objects 
representing the (combined) analyses is per- 
formed through the parsing and processing of the 
datacard. This approach has two advantages: it 
factorises the analysis description from the actual 
statistical studies and provides to the analyses 
groups a common base to share their results. 

The calculations implied by certain statistical 
methods may be very CPU-intensive, e.g. if the 
execution of many Monte Carlo toy experiments 
is foreseen. The RSC classes that implement 
those methods are designed to ease the creation of 
multiple jobs intended for a batch system or the 
grid. The outcome is collected in results objects 
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which can be written on disk: to every statistical 
method class corresponds a results class. Such ob- 
jects can be recollected and added together there- 
with profiting from a very high Monte Carlo toy 
experiments statistic. 

Special care was devoted to the display of the 
results: for every results class, a plot class is 
implemented. Therefore in the final step of a 
statistical study the user can get a plot object 
out of a results object and draw it. These two 
last operations are straightforward and need only 
two lines of C++ code to be carried out. In ad- 
dition to the plots directly linked to the results, 
RSC gives the possibility to build other widely ac- 
cepted plots (see Fig. [3]) with standalone classes. 
Examples of plots produced with RSC are Fig. 
O Fig. [L2l Fig. El 

A big effort was dedicated to the RSC 
documentation and examples. All the classes, 
methods, members and namespaces are provided 
with Doxygen style comments and a website of 
RooStatsCms is available pQ. A wikipage is also 
present in the official CMS wiki 0. The frame- 
work is distributed with example ROOT macros, 
C++ programs ready to be compiled and Python 
scripts to ease the creation of the datacards. 

A possible application of the RSC implemen- 
tation of the statistical methods described in 11.11 
and [O] is discussed in section [3] 

3. A concrete application 

For the the CMS VBF H -> rr analysis 
(1 fb' 1 ) [8J study, RSC was used. Given the 
disadvantageous s/b ratio in the various mass hy- 
potheses with this integrated luminosity, we put 
an exclusion limit on the H boson production 
cross section using the technique described in sec- 
tion ll.il Fig. [3]shows the expected H production 
cross section (a) that could be excluded with the 
data available, i.e. the Monte Carlo (MC) stati- 
stic available after all the cuts, in units of Stan- 
dard Model (SM) a. To derive the shape of the 
pdf for the two hypotheses several toy MC ex- 
periments were performed. To take into account 
the systematics, before the generation of each toy 
MC dataset, the parameters values affected by 
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Figure 3. Exclusion plot for the VBF H — > tt 
CMS analysis. The lcr and 2a bands are ob- 
tained assuming to observe a la or 2a upwards 
(downwards) fluctuation of the number of ob- 
served background events. The plot is produced 
with a graphics class of RSC. 



systematic uncertainties were fluctuated accord- 
ing to their expected distribution, assumed in this 
case Gaussian. 

Performing several toy MC experiments we de- 
rived also the distribution of the upper limits at 
95% confidence level shown in Fig. [4] focusing on 
the number of signal events {N s ). In this second 
study systematic uncertainties were not consid- 
ered. An analysis of the coverage, i.e. the frac- 
tion of cases in which the upper limit is indeed 
greater than the nominal N s , was carried out 
(Fig. [5J. The N s value was also artificially incre- 
mented in order to study the coverage at different 
working points. Overcoverage is present for low 
signal yields leading to conservative upper limits. 
This feature can be seen as a consequence of the 
Cramer- Frechet inequality [4]. 

4. Conclusions 

Two statistical analyses were carried out using 
the RSC framework. The hypotheses separation 
study shows that it is possible to compute exclu- 
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Figure 4. The distribution of the 95% CL upper 
limits on the number of signal events for rriH = 
145 GeV. The median is located at 10.71 signal 
events which corresponds to a o/osm equal to 
6.7. 



Figure 5. The coverage in case of 68%, 84%, 90%, 
95% CL (coloured lines) versus nominal N s . The 
reference coverages are shown with the dashed 
lines. 
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tion test by means of the integrals of the discri- 
minating variables pdf. The PL method can be 4. 
used as well for extracting upper limits but it was 
verified that a significant overcoverage occurs in 
case of very small quantities. 5. 

The RooStatsCms proved to be a reliable and 
valuable tool for analyses modelling and complex 
statistical procedures deployment in a simple and 6. 
flexible way. The graphical routines embedded in 
the package serve as a powerful tool for results 
visualisation. 7. 
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